Add support for HCA tissue atlas (#7128)#7877
Conversation
a5fa3db to
3fb45f1
Compare
3fb45f1 to
95a800d
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #7877 +/- ##
===========================================
+ Coverage 84.73% 84.77% +0.04%
===========================================
Files 165 165
Lines 24074 24138 +64
===========================================
+ Hits 20400 20464 +64
Misses 3674 3674 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
080a58a to
da4787e
Compare
1798785 to
c0b9465
Compare
achave11-ucsc
left a comment
There was a problem hiding this comment.
Looking good,
Consider adding the testing infra in an initial (maybe split) commit, it gives a stronger signal. As well a test case for None namespace in Nested.to_index/from_index round-trip.
| path = dotted(facet_path, 'keyword') | ||
| nested_agg.bucket(name='myTerms', | ||
| agg_type='terms', | ||
| field=path, | ||
| size=config.terms_aggregation_size) | ||
| nested_agg.bucket('untagged', 'missing', field=path) |
There was a problem hiding this comment.
Since this is the same type of invocation as the one below, isn't a FIXME in order?
Additionally, you should consider not duplicating the section and just having the one FIXME, as it is now.
There was a problem hiding this comment.
The FIXME is for a closed/invalid isseue (#3413). Removed.
| bucket['key'] = field_type.from_index(bucket['key']) | ||
| translate(k, bucket) | ||
| try: | ||
| # From a MultiTerms aggregation |
There was a problem hiding this comment.
Consider elaborating on this comment (and perhaps also the one in L423) as to why the MultiTerms aggregation isn't recursive.
Lastly, this comment is misplaced. It should be part of the commit which introduced the change.
| try: | ||
| paths = v['meta']['paths'] | ||
| except KeyError: | ||
| pass | ||
| else: | ||
| for i, path in enumerate(paths): | ||
| field_type = self.service.field_type(self.catalog, tuple(path)) | ||
| for bucket in buckets: | ||
| bucket['key'][i] = field_type.from_index(bucket['key'][i]) |
There was a problem hiding this comment.
I don't think this is semantically neutral. Consider adding this as a subsequent commit or squashing it with the main changes.
Also, consider elaborating on the "Refactor method" commit message by adding some context, if you do decide to keep it.
There was a problem hiding this comment.
This section should have been in the main commit, not the refactoring commit. Moved.
The bundle |
c0b9465 to
d67c292
Compare
5b5526f to
0910a5e
Compare
| is a flat mapping and includes the synthetic field 'accessible' that has | ||
| no entry in the plugin's field_mapping. | ||
|
|
||
| :return: dict with field names as keys and each field's type as value |
There was a problem hiding this comment.
| :return: dict with field names as keys and each field's type as value | |
| :return: a mapping from each field's name to its type |
| result = {} | ||
| for field, path in plugin.field_mapping.items(): | ||
| field_type = self.field_type(catalog, path) | ||
| if isinstance(field_type, FieldType): |
There was a problem hiding this comment.
TODO: investigate in which case field_type is not of type FieldType
There was a problem hiding this comment.
The returned value from DocumentService.field_type() is always a FieldType, and there is already an assert in that method, so the assert here is unnecessary. I will remove.
| ) | ||
|
|
||
| @cache | ||
| def field_types_by_name(self, catalog: CatalogName) -> Mapping[str, FieldType]: |
There was a problem hiding this comment.
TODO: rename method to mapped_field_types(), use FieldName instead of str.
| 'is_tissue_atlas_project': any(bionetwork.atlas_project | ||
| for bionetwork in project.bionetworks), | ||
| 'tissue_atlas': list(map(self._tissue_atlas, project.bionetworks)), | ||
| # We deduplicate the `tissue_atlas` field values since duplicate |
There was a problem hiding this comment.
TODO: replace custom deduplication with an existing mechanism, and investigate why this is necessary since a SetOfDictAccumulator is used during aggregation.
| else: | ||
| return str(term_key) | ||
|
|
||
| if isinstance(field_type, Nested): |
There was a problem hiding this comment.
TODO: Rename nested_keys to nested_property_names. Remove nested_keys param from inner function choose_entry(). Populate nested_property_names with field_type.properties instead of agg['myTerms']['meta']['paths']
| type='terms') | ||
|
|
||
| def make_facets(self, aggs: JSON) -> dict[str, Terms]: | ||
| field_types = self.service.field_types_by_name(self.catalog) |
| agg.aggs.nested.bucket(name='myTerms', | ||
| agg_type='multi_terms', | ||
| terms=[ | ||
| {'field': path + f'.{field}.keyword'} |
There was a problem hiding this comment.
TODO: use dotted()
| if isinstance(agg, Terms): | ||
| # A Terms agg is for a single field, so we only put one | ||
| # field path in `paths`. | ||
| path = agg.field.removesuffix('.keyword').split('.') |
There was a problem hiding this comment.
TODO: Remove `.removesuffix('.keyword'), instead split first with an existing method, assert last element is 'keyword' and remove.
| # these paths to the values in the aggregation buckets. | ||
| agg.meta['paths'] = [] | ||
| for term in agg.terms: | ||
| path = term['field'].removesuffix('.keyword').split('.') |
There was a problem hiding this comment.
TODO: remove duplication between if/else branches.
0910a5e to
f18e8bc
Compare
There was a problem hiding this comment.
Index: src/azul/indexer/document_service.py
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/src/azul/indexer/document_service.py b/src/azul/indexer/document_service.py
--- a/src/azul/indexer/document_service.py (revision c7ffa182dfc0426054dece0e6a841710c58ad5e3)
+++ b/src/azul/indexer/document_service.py (date 1779471299837)
@@ -126,12 +126,10 @@
@cache
def mapped_field_types(self, catalog: CatalogName) -> Mapping[FieldName, FieldType]:
"""
- Returns the field type for each supported sort and filter field, using
- the name of the field as provided by clients. Unlike field_types(), this
- is a flat mapping and includes the synthetic field 'accessible' that has
- no entry in the plugin's field_mapping.
-
- :return: a mapping from each field's name to its type
+ Returns the field type for each supported sort and filter field, keyed
+ to the name of the field as provided by clients. Unlike field_types(),
+ this is a flat mapping and includes synthetic fields like 'accessible'
+ that lack an entry in the plugin's field_mapping.
"""
plugin = self.metadata_plugin(catalog)
result = {}| @@ -224,19 +223,4 @@ def _filter_schema_validator(self, | |||
|
|
|||
| @cache | |||
There was a problem hiding this comment.
Don't cache twice. Make a full pass over your changes to ensure that this problem doesn't occur anywhere else.
There was a problem hiding this comment.
I can also only see one callsite, so you may be able to inline this method.
| if isinstance(field_types, Nested): | ||
| element = next(elements, None) | ||
| if element is not None: | ||
| assert element == field_types.agg_property, (element, field_types) |
There was a problem hiding this comment.
Why did you remove this assertion?
There was a problem hiding this comment.
My PR removes the agg_property property from Nested objects. Previously this property was populated with the first key in the nested dictionary (e.g. 'atlas'), and used in the nested aggregation path (e.g. path = dotted(facet_path, agg_property, 'keyword')).
With my changes, the nested aggregation now uses only the facet path (e.g. 'contents.projects.tissue_atlas'), and a multi-terms aggregation is nested inside which contains the path of each nested field property (e.g. 'contents.projects.tissue_atlas.atlas.keyword' and 'contents.projects.tissue_atlas.version.keyword').
| 'tissue_atlas': list(unique_everseen( | ||
| map(self._tissue_atlas, project.bionetworks), | ||
| key=lambda d: frozenset(d.items()) | ||
| )), |
There was a problem hiding this comment.
This is still semantically different from what the accumulator does. This uses all items of the dict while the accumulator hard-codes two specific keys. If we ever added an entry to the dict, the implementations would diverge. For Nested fields, you can assume that all entries of the value should be considered for equality between values, so this implementation is actually more general. Please, check if we can generalize SetOfDictAccumulator to have that function by default, if no key is supplied. If that's too involved, file a new issue. In either case, make sure that this here uses the accumulator, instead of reinventing the wheel.
Implement support for aggregation of a nested field
Linked issues: #7128
Checklist
Author
developissues/<GitHub handle of author>/<issue#>-<slug>1 when the issue title describes a problem, the corresponding PR
title is
Fix:followed by the issue titleAuthor (partiality)
ptag to titles of partial commitspartialor completely resolves all linked issuespartiallabelAuthor (reindex)
rtag to commit title or the changes introduced by this PR will not require reindexing of any deploymentreindex:devor the changes introduced by it will not require reindexing ofdevreindex:anvildevor the changes introduced by it will not require reindexing ofanvildevreindex:anvilprodor the changes introduced by it will not require reindexing ofanvilprodreindex:prodor the changes introduced by it will not require reindexing ofprodreindex:partialand its description documents the specific reindexing procedure fordev,anvildev,anvilprodandprodor requires a full reindex or carries none of the labelsreindex:dev,reindex:anvildev,reindex:anvilprodandreindex:prodAuthor (mirror)
mirror:devor the changes introduced by it will not require mirroring ofdevmirror:anvildevor the changes introduced by it will not require mirroring ofanvildevmirror:anvilprodor the changes introduced by it will not require mirroring ofanvilprodmirror:prodor the changes introduced by it will not require mirroring ofprodmirror:partialand its description documents the specific mirroring procedure fordev,anvildev,anvilprodandprodor requires a full mirroring or carries none of the labelsmirror:dev,mirror:anvildev,mirror:anvilprodandmirror:prodAuthor (API changes)
APIor this PR does not modify a REST APIa(A) tag to commit title for backwards (in)compatible changes or this PR does not modify a REST APIapp.pyor this PR does not modify a REST APIAuthor (upgrading deployments)
make docker_images.jsonand committed the resulting changes or this PR does not modifyazul_docker_images, or any other variables referenced in the definition of that variableutag to commit title or this PR does not require upgrading deploymentsupgradeor does not require upgrading deploymentsdeploy:sharedor does not modifydocker_images.json, and does not require deploying thesharedcomponent for any other reasondeploy:gitlabor does not require deploying thegitlabcomponentdeploy:runneror does not require deploying therunnerimageAuthor (hotfixes)
Ftag to main commit title or this PR does not include permanent fix for a temporary hotfixanvilprodandprod) have temporary hotfixes for any of the issues linked to this PRAuthor (before every review)
develop, squashed fixups from prior reviewsmake requirements_updateor this PR does not modifyDockerfile,environment,requirements*.txt,common.mk,Makefileorenvironment.bootRtag to commit title or this PR does not modifyrequirements*.txtreqsor does not modifyrequirements*.txtmake integration_testpasses in personal deployment or this PR does not modify functionality that could affect the IT outcomePeer reviewer (after approval)
Note that after requesting changes, the PR must be assigned to only the author.
System administrator (after approval)
demoorno demono demono sandboxN reviewslabel is accurateOperator
reindex:…labels andrcommit title tagmirror:…labelsno demodevelopOperator (deploy
.sharedand.gitlabcomponents)_select dev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unusedor this PR is not labeleddeploy:shared_select dev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab applyor this PR is not labeleddeploy:gitlab_select anvildev.shared && CI_COMMIT_REF_NAME=develop make -C terraform/shared apply_keep_unusedor this PR is not labeleddeploy:shared_select anvildev.gitlab && CI_COMMIT_REF_NAME=develop make -C terraform/gitlab applyor this PR is not labeleddeploy:gitlabdeploy:gitlabdeploy:gitlabSystem administrator (post-deploy of
.gitlabcomponent)dev.gitlabare complete or this PR is not labeleddeploy:gitlabanvildev.gitlabare complete or this PR is not labeleddeploy:gitlabOperator (deploy runner image)
_select dev.gitlab && make -C terraform/gitlab/runneror this PR is not labeleddeploy:runner_select anvildev.gitlab && make -C terraform/gitlab/runneror this PR is not labeleddeploy:runnerOperator (sandbox build)
sandboxlabel or PR is labeledno sandboxdevor PR is labeledno sandboxanvildevor PR is labeledno sandboxsandboxdeployment or PR is labeledno sandboxanvilboxdeployment or PR is labeledno sandboxsandboxdeployment or PR is labeledno sandboxanvilboxdeployment or PR is labeledno sandboxsandboxor this PR does not remove catalogs or otherwise causes unreferenced indices insandboxanvilboxor this PR does not remove catalogs or otherwise causes unreferenced indices inanvilboxsandboxor this PR is not labeledreindex:devanvilboxor this PR is not labeledreindex:anvildevsandboxor this PR is not labeledreindex:devanvilboxor this PR is not labeledreindex:anvildevsandboxor this PR is not labeledmirror:devanvilboxor this PR is not labeledmirror:anvildevsandboxor this PR is not labeledmirror:devanvilboxor this PR is not labeledmirror:anvildevOperator (merge the branch)
pif the PR is also labeledpartialOperator (main build)
devanvildevdevdevanvildevanvildev_select dev.shared && make -C terraform/shared applyor this PR is not labeleddeploy:shared_select anvildev.shared && make -C terraform/shared applyor this PR is not labeleddeploy:shareddevanvildevOperator (reindex)
devor this PR is neither labeledreindex:partialnorreindex:devanvildevor this PR is neither labeledreindex:partialnorreindex:anvildevdevor this PR is neither labeledreindex:partialnorreindex:devanvildevor this PR is neither labeledreindex:partialnorreindex:anvildevdevor this PR is neither labeledreindex:partialnorreindex:devanvildevor this PR is neither labeledreindex:partialnorreindex:anvildevdevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdevor this PR does not require reindexingdevdevor this PR does not require reindexingdevdeploy_browserjob in the GitLab pipeline for this PR indevor this PR does not require reindexingdevanvildevor this PR does not require reindexinganvildevdeploy_browserjob in the GitLab pipeline for this PR inanvildevor this PR does not require reindexinganvildevOperator (mirroring)
devor this PR is not labelledmirror:devanvildevor this PR is not labelledmirror:anvildevdevor this PR is not labelledmirror:devanvildevor this PR is not labelledmirror:anvildevdevor this PR is not labelledmirror:devanvildevor this PR is not labelledmirror:anvildevOperator
deploy:shared,deploy:gitlab,deploy:runner,API,reindex:partial,reindex:anvilprod,reindex:prod,mirror:partial,mirror:anvilprodandmirror:prodlabels to the next promotion PRs or this PR carries none of these labelsdeploy:shared,deploy:gitlab,deploy:runner,API,reindex:partial,reindex:anvilprod,reindex:prod,mirror:partial,mirror:anvilprodandmirror:prodlabels, from the description of this PR to that of the next promotion PRs or this PR carries none of these labelsShorthand for review comments
Lline is too longWline wrapping is wrongQbad quotesFother formatting problem